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Abstract 



(N 
O 

' In this paper, we study statistical properties of semi-supervised learning, whicli is con- 

, sidered as an important problem in the community of machine learning. In the standard 

supervised learning, only the labeled data is observed. The classification and regression prob- 
lems are formalized as the supervised learning. In semi-supervised learning, unlabeled data 
is also obtained in addition to labeled data. Hence, exploiting unlabeled data is important 
to improve the prediction accuracy in semi-supervised learning. This problems is regarded 
as a semiparametric estimation problem with missing data. Under the the discriminative 
probabilistic models, it had been considered that the unlabeled data is useless to improve 
the estimation accuracy. Recently, it was revealed that the weighted estimator using the 
^ I , unlabeled data achieves better prediction accuracy in comparison to the learning method us- 

ing only labeled data, especially when the discriminative probabilistic model is misspecified. 
That is, the improvement under the semiparametric model with missing data is possible, 
^ ' when the semiparametric model is misspecified. In this paper, we apply the density-ratio 

estimator to obtain the weight function in the semi-supervised learning. The benefit of our 
' approach is that the proposed estimator does not require well-specified probabilistic models 

^f) , for the probability of the unlabeled data. Based on the statistical asymptotic theory, we 

prove that the estimation accuracy of our method outperforms the supervised learning using 
, only labeled data. Some numerical experiments present the usefulness of our methods. 

(N 



1 Introduction 

^ ■ In this paper, we analyze statistical properties of semi-supervised learning. In the standard 

, supervised learning, only the labeled data (x, y) is observed, and the goal is to estimate the 

relation between x and y. In semi-supervised learning (Chapelle et al., 2006), the unlabeled 
data x' is also obtained in addition to labeled data. In real-world data such as the text data, 
we can often obtain both labeled and unlabeled data. A typical example is that x and y stand 
for the text of an article, and the tag of the article, respectively. Tagging the article demands 
a lot of effort. Hence, the labeled data is scarce, while the unlabeled data is abundant. In 
semi-supervised learning, studying methods of exploiting unlabeled data is an important issue. 

In the standard semi-supervised learning, statistical models of the joint probability p{x,y), 
i.e., generative models, are often used to incorporate the information involved in the unlabeled 
data into the estimation. For example, under the statistical model p{x, y; (3) having the pa- 
rameter /3, the information involved in the unlabeled data is used to estimate the parameter 
/3 via the marginal probability p{x] (3) = J p{x, y; f3)dy. The amount of information in unla- 
beled samples is studied by (Castelli &: Cover, 1996; Dillon et al., 2010; Sinha & Belkin, 2007). 
This approach is developed to deal with a various data structures. For example, semi-supervised 
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learning with manifold assumption or cluster assumption has been studied along this line (Belkin 
&; Niyogi, 2004; Laffcrty & Wasserman, 2007). Under some assumptions on generative models, 
it is revealed that unlabeled data is useful to improve the prediction accuracy. 

Statistical models of the conditional probability p{y\x), i.e., discriminative models, are also 
used in semi-supervised learning. It seems that the unlabeled data is not useful that much 
for the estimation of the conditional probability, since the marginal probability does not have 
any information on p{y\x) (Lasserre et al., 2006; Seeger, 2001; Zhang &; Oles, 2000). Indeed, 
the maximum likelihood estimator using a parametric model of p{y\x) is not affected by the 
unlabeled data. Sokolovska, et al. (Sokolovska et al., 2008), however, proved that even under 
discriminative models, unlabeled data is still useful to improve the prediction accuracy of the 
learning method with only labeled data. 

Semi-supervised learning methods basically work well under some assumptions on the pop- 
ulation distribution and the statistical models. However, it was also reported that the semi- 
supervised learning has a possibility to degrade the estimation accuracy, especially when a 
misspecified model is applied (Cozman et al., 2003; Grandvalet & Bengio, 2005; Nigam et al., 
1999). Hence, a safe semi-supervised learning is desired. The learning algorithms proposed by 
Sokolovska, et al. (Sokolovska et al., 2008) and Li and Zhou (Li & Zhou, 2011) have a theoretical 
guarantee such that the unlabeled data does not degrade the estimation accuracy. 

In this paper, we develop the study of (Sokolovska et al., 2008). To incorporate the informa- 
tion involved in unlabeled data into the estimator, Sokolovska, et al. (Sokolovska et al., 2008) 
used the weighted estimator. In the estimation of the weight function, a well-specified model 
for the marginal probability p{x) was assumed. This is a strong assumption for semi-supervised 
learning. To overcome the drawback, we apply the density-ratio estimator for the estimation 
of the weight function (Sugiyama &: Kawanabe, 2012; Sugiyama et al., 2012). We prove that 
the semi-supervised learning with the density-ratio estimation improves the standard supervised 
learning. Our method is available not only classification problems but also regression problems, 
while many semi-supervised learning methods focus on binary classification problems. 

This paper is organized as follows. In Section 2, we show the problem setup. In Section 
3, we introduce the weighted estimator investigated by Sokolovska, et al., (Sokolovska et al., 
2008). In Section 4, we briefly explain the density-ratio estimation. In Section 5, the asymptotic 
variance of the estimators under consideration is studied. Section 6 is devoted to prove that the 
weighted estimator using labeled and unlabeled data outperforms the supervised learning using 
only labeled data. In Section 7, numerical experiments are presented. We conclude in Section 
8. 

2 Problem Setup 

We introduce the problem setup. We suppose that the probability distribution of training 
samples is given as 

(.Xi,yi)r~-u,d.p{y\x)p{x), i = l,....n, x'jr~^u,d_q{x), j = l,...,n', (1) 

where p{y\x) is the conditional probability y € y given x & X, and p{x) and q{x) are the 
marginal probabilities on X. Here, q{x) is regarded as the probability in the testing phase, 
i.e., the test data {x,y) is distributed from the joint probability p{y\x)q{x), and the estimation 
accuracy is evaluated under the test probability. The paired sample {xi,yi) is called "labeled 
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data", and the unpaired sample Xj is called "unlabeled data". Our goal is to estimate the 

conditional probability p{y\x) or the conditional expectation E[y\x] based on the labeled and 
unlabeled data in (1). When 3^ is a finite set, the problem is called the classification problem. 
For 3^ = M, the estimation of -E[y|a;] is referred to as the regression problem. 

We describe the assumption on the marginal distributions, p{x) and q{x) in (1). In the 
context of the covariate shift adaptation (Shimodaira, 2000), the assinnption that p{x) ^ q{x) 
is employed in general. The weighted estimator with the weight function q{x)/p{x) is used to 
correct the estimation bias induced by the covariate shift; see (Sugiyama & Kawanabe, 2012; 
Sugiyama et al., 2012) for details. Hence, the estimation of the weight function q{x)/p{x) is 
important to achieve a good estimation accuracy. On the other hand, in the semi- supervised 
learning (Chapellc ct al., 2006), the equality p{x) = q[x) is assumed, and often n' is much larger 
than n. This setup is also quite practical. For example, in the text data mining, the labeled 
data is scarce, while the unlabeled data is abundant. In this paper, we assume that the equality 

p{x) = q{x) (2) 

holds. 

We define the following semiparametric model, 

M = {^p{y\x; a)r{x) : ct E A cM.'^, r eV^ (3) 

for the estimation of the conditional probability p{y\x), where V is the set of all probability 
densities of the covariate x. The parameter of interest is ot, and r{x) G "P is the nuisance 
parameter. The model A4 does not necessarily include the true probability p{y\x)q{x), i.e., 
there may not exist the parameter a such that p{y\x) = p{y\x; a) holds. This is the significant 
condition, when we consider the improvement of the inference with the labeled and unlabeled 
data. Our target is to estimate the parameter oc* satisfying 

max£;[logp(y|a;;a)] = E[logp{y\x; a*)], (4) 

ot&A 

in which E[-] denotes the expectation with respect to the population distribution. If the model 
A4 includes the true probability, we have p{y\x;cx*) = p{y\x) due to the non-negativity of 
Kullback-Leiblcr divergence (Cover & Thomas, 2006). In the misspecified setup, however, the 
equality p{y\x; a*) = p{y\x) is not guaranteed. 

3 Weighted Estimator in Semi-supervised Learning 

We introduce the weighted estimator. For the estimation of p{y\x) under the model (3), we 
consider the maximum likelihood estimator (MLE). For the statistical model p{y\x;cx.), let 
u{x,y;oi) G be the score function 

u{x, y;a) =V logp{y\x; a), 

where V denotes the gradient with respect to the model parameter. Then, for any a € A, we 
have 

J u(x,y;a)p{y\x;a)p{x)dxdy = 0. 
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In addition, the extremal condition of (4) leads to 

J u{x,y;cx*)p{y\x)p{x)dxdy = 0. 

Hence, we can estimate the conditional density p{y\x) by p{y\x; a), where a is a solution of the 
estimation equation 

1 " 

-S2u{xi,yi;a) = 0. (5) 

i=l 

Under the regularity condition, the MLE has the statistical consistency to the parameter a* in 
(4); see (van der Vaart, 1998) for details. In addition, the score function u is an optimal choice 
among Z-estimators (van der Vaart, 1998), when the true probability density is included in the 
model M. This implies that the efficient score of the semiparametric model Ai is the same as 
the score function of the model p{y\x;a). This is because, in the semiparametric model Ai, 
the tangent space of the parameter of interest is orthogonal to that of the nuisance parameter. 
Here, the asymptotic variance matrix of the estimated parameter is employed to compare the 
estimation accuracy. 

Next, wc consider the setup of the semi-supervised learning. When the model M is specified, 
we find that the estimator (5) using only the labeled data is efficient. This is obtained from 
the results of numerous studies about the semiparametric inference with missing data; see (Nan 
et al., 2009; Robins et al., 1994) and references therein. 

Suppose that the model A4 is misspecified. Then, it is possible to improve the MLE in (5) by 
using the weighted MLE (Sokolovska et al., 2008). The weighted MLE is defined as a solution 
of the equation, 

1 

-y^w{xi)u{xi,yi;a) = 0, (6) 
1=1 

where w{x) is a weight function. Suppose that w{x) = q{x)/p{x). Then the law of large numbers 
leads to the probabilistic convergence, 

1 ^ p I' q(x) f 

-Y^^w{xi)u{xi,yi;a) ^ J ^^u{x,y;a)p{y\x)p{x)dx = J u{x,y;a)p{y\x)q{x)dx. 

Hence the estimator p{y\x;a) based on (6) will provide a good estimator of p{y\x) under the 
marginal probability q{x). This indicates that p{y\x;a.) is expected to approximate p{y\x) over 
the region on which q{x) is large. The weight function w{x) has a role to adjust the bias of the 
estimator under the covariate shift (Shimodaira, 2000). On the setup of the semi-supervised 
learning, however, w{x) = q{x)/p{x) = 1 holds, and it is known beforehand. Hence, one may 
think that there is no need to estimate the weight function. Sokolovska, et al., (Sokolovska et al., 
2008) showed that estimation of the weight function is useful, even though it is already known 
in the semi-supervised learning. 

We briefly introduce the result in (Sokolovska et al., 2008). Let the set X be finite. Then, 
■p is a finite dimensional parametric model. Suppose that the sample size of the unlabeled 
data is enormous, and that the probability function q{x) on X is known with a high degree of 
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accuracy. The probability p{x) is estimated by the maximum hkehhood estimator p{x) based on 
the samples {xj}"^^ in the labeled data. Then, Sokolovska, ct al. (Sokolovska et al., 2008) showed 
that the weighted MLE (6) with the estimated weight function w{x) = q(x)/p{x) improves the 
naive MLE, when the model M. is misspecified, i.e., p{y\x)q{x) M.. 

Shimodaira (Shimodaira, 2000) pointed out that the weighted MLE using the exact density 
ratio w{x) = q{x)/p{x) has the statistical consistency to the target parameter cc*, when the 
covariate shift occurs. Under the regularity condition, it is rather straightforward to sec that 
the weighted MLE using the estimated weight function w{x) = q{x)/p{x) also converges to ct* 
in probability, since p{x) converges to p{x) in probability. Sokolovska's result implies that when 
p{x) = q{x) holds, the weighted MLE using the estimated weight function improves the weighted 
MLE using the true density ratio in the sense of the asymptotic variance of the estimator. 

The phenomenon above is similar to the statistical paradox analyzed by (Henmi &; Eguchi, 
2004; Henmi et al., 2007). In the semi-parametric estimation, Henmi and Eguchi (Henmi & 
Eguchi, 2004) pointed out that the estimation accuracy of the parameter of interest can be 
improved by estimating the nuisance parameter, even when the nuisance parameter is known 
beforehand. Hirano, et al., (Hirano et al., 2003) also pointed out that the estimator with the 
estimated propensity score is more efficient than the estimator using the true propensity score 
in the estimation of the average treatment effects. Here, the propensity score corresponds to 
the weight function w{x) in our context. The degree of improvement is described by using the 
projection of the score function onto the subspace defined by the efficient score for the semi- 
parametric model. In our analysis, also the projection of the score function u{x, y; a) plays an 
important role as shown in Section 6. 

For the estimation of the weight function in (6), we apply the density-ratio estimator 
(Sugiyama &: Kawanabe, 2012; Sugiyama et al., 2012) instead of estimating the probability 
densities separately. We show that the density-ratio estimator provides a practical method for 
the semi-supervised learning. In the next section, we introduce the density-ratio estimation. 



4 Density-ratio estimation 

Density-ratio estimators are available to estimate the weight function w{x) = q{x)/p{x). Re- 
cently, methods of the direct estimation for density-ratios have been developed in the machine 
learning community (Sugiyama & Kawanabe, 2012; Sugiyama et al., 2012). We apply the 
density-ratio estimator to estimate the weight function w{x) instead of using the estimator of 
each probability density. 

We briefly introduce the density-ratio estimator according to (Qin, 1998). Suppose that the 
following training samples are observed, 

Xi'^i.i.d.Pix), i = l,....n, Xjr^i,i,d.q{x), j = l,...,n'. (7) 

Our goal is to estimate the density-ratio w{x) = q{x)/p{x). The r-dimensional parametric model 
for the density-ratio is defined by 

w{x; 6) = exp{0i(/>i(x) + • • • + 0r(t)r{x)}, (8) 

where (j)i{x) = 1 is assumed. For any function r){x\ 6) € W which may depend on the parameter 
6, one has the equality 

jr,ix-eMx)pix)dx-jr,ix-e)qix)dx = 
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Hence, the empirical approximation of the above equation is expected to provide an estimation 
equation of the density-ratio. The empirical approximation of the above equality under the 
parametric model of w{x; 9) is given as 

- V vixf, e)w{xi- 0) - - V »7(4; e) = o. (9) 

1=1 j=i 

Let be a solution of (9), and then, w{x; 6) is an estimator of w{x). Note that we do not need 
to estimate probability densities p{x) and q{x) separately. The estimation equation (9) provides 
a direct estimator of the density-ratio based on the moment matching with the function rf{x;0). 
Qin (Qin, 1998) proved that the optimal choice of r}{x; 6) is given as 

1 + n' /n ■ w{x;0) 1 + n' fn ■ w{x;d) 

where 4>{x) = {(pi{x), . . . , (pr{x))'^ ■ By using r]{x;6) above, the asymptotic variance matrix of 
9 is minimized among the set of moment matching estimators, when w{x) is realized by the 
model w{x;9). Hence, (10) is regarded as the counterpart of the score function for parametric 
probability models. 

5 Semi- Supervised Learning with Density-Ratio Estimation 

We study the asymptotics of the weighted MLE (6) using the estimated density-ratio. The 
estimation equation is given as 

f ^ n 

- ^ w{xi; 9)u{xi, yf, a) = 0, 

^^~n ^ n' (11) 

- V r){xi- 9)w{xi- 0) - - V 77(4; 6>) = 0- 

1=1 j=i 

Here, the statistical models (3) and (8) are employed. The first equation is used for the estimation 
of the parameter a of the model p{y\x; a), and the second equation is used for the estimation of 
the density-ratio w{x; 9). The estimator defined by (11) is refereed to as density-ratio estimation 
based on semi supervised learning, or DRESS for short. 

In Sokolovska, et al.(Sokolovska et al., 2008), the marginal probability density p{x) is esti- 
mated by using a well-specified parametric model. Clearly, preparing the well-specified para- 
metric model is not practical, when X is not finite set. On the other hand, it is easy to prepare 
a specified model of the density-ratio w{x), whenever p(x) = q{x) holds in (1). The model (8) 
is an example. Indeed, ■w{x; 0) = 1 holds. Hence, the assumption that the true weight function 
is realized by the model w{x; 9) is not of an obstacle in semi-supervised learning. 

We show the asymptotic expansion of the estimation equation (11). Let a. and be a 
solution of (11). In addition, define a* be a solution of 

u(x,y; a)p(y\x)p{x)dxdy = 
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and 9* be the parameter such that w{x; 6*) = 1, i.e., 6* = 0. We prepare some notations: u = 
u{x,y;a*), rj = t7(x;0*), Ui = u{xi,yi;a*), rji = T7(xj;0*), rj'j = r]{xy,e*), Sa = a - a*, SO = 

6 9*. The Jacobian of the score function u with respect to the parameter a is denoted as 
\7u, i.e., the dhy d matrix whose element is given as {\7u{x, y; cx))ik = Qa%ak ^'^^Pivl^'^ "^^^ 
variance matrix and the covariance matrix under the probabihty p(y\x)p{x) are denoted as V[-] 
and Cov[-, ■], respectively. Without loss of generality, we assume that rj at 9 = 9* is represented 
as 

ri{x;9*) = 0(.x) + 0(.x), 

where is an arbitrary function orthogonal to i.e., E[(f)4p-] = O holds. If r/(x; 9*) does 

not have any component which is represented as a linear transformation of (f>{x), the estimator 
would be degenerated. Under the regularity condition, the estimated parameters, a. and 9, 
converge to ot* and 9*, respectively. The asymptotic expansion of (11) around (o:, 9) = {a*, 9*) 
leads to 

1 

E[Vu]6a + E[u<j/]59 = — ^Ui + Op(n-^/2), 



E[ct>ct>-]59 = ^±r,',-l±r,, + o,{n-^l\ 
j=i 1=1 

Hence, we have 

E[Vu]5cx = -Y,{E[u4>^\E[ct>ct>'^\-^'ni - u,] - -Y,E[u4>^]E[4>4>^]-''n'^ + Op(n-V2). 



n — ' n 

1=1 ]=i 



Therefore, we obtain the asymptotic variance, 
n ■ E[Vu]V[5oc]E[Vuf 
= V[u] + (^1 + ^^S[w0^]S[«/>0^]-^F[r7]E[00^]-i£;[«/>u^] 
- E[u<t)^]E[(t)(t)'^]-^Coy[r),u] - CoY[u,r]\E[ct)(t)^\-^E[<t)u^] + o(l) 
On the other hand, the variance of the naive MLE, a, defined as a solution of (5) is given as 

n ■ E[Vu]V[5oi]E[Vu]^ = V[u] + o(l), 

where 5ol = a — a* . 

6 Maximum Improvement by Semi- Supervised Learning 

Given the model for the density-ratio w{x\ 9), we compare the asymptotic variance matrices of 
the estimators, a and S. First, let us define 



u{x) = J u{x,y-a*)p{y\x)dy, 
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i.e., u{x) is the projection of the score function u{x,y;a.*) onto the subspace consisting of all 
functions depending only on x, where the inner product is defined by the expectation under the 
joint probability p{y\x)p{x). Note that the equality E[u] = holds. Let the matrix B be 



B = E[u(t)'^]E[(t)(t>'^]-\ 

Then, a simple calculation yields that the difference of the variance matrix between a and a is 
equal to 

Difr[-u] := n • E[Vu]V[Sa]E[Vuf - n • E[Vu]V[6cx]E[Vuf 

In the second equality, we supposed that ra'/n converges to a positive constant. When Diff [it] is 
positive definite, the estimator a. using the labeled and unlabeled data improves the estimator 
a using only the labeled data. It is straightforward to see that the improvement is not attained 
if ti = holds. In general, the score function u{x,y;a) = V logp(y|x; ct) satisfies u = 0, if 
the model is specified. When the model of the conditional probability p{y\x) is misspecified, 
however, there is a possibility that the proposed estimator (11) outperforms the MLE 5. 

We derive the optimal moment function rj for the estimation of the parameter a*. The 
optimal T) can be different from (10). We prepare some notations. Let n^iZ be the M'^- valued 
function on A!, each element of which is the projection of each element of u onto the subspace 
spanned by {0i(x), . . . , 0r(a;)}- Here, the inner product is defined by the expectation under 
the marginal probability p{x). In addition, let 11^ w be the projection of u onto the orthogonal 
complement of the subspace, i.e., 11^ w = u — II^w. 

Theorem 1. We assume that the model of the density-ratio is defined as 

w{x;0) = exTp{<p{x)^ 0} 

with the basis functions cf){x) = {<j)i{x), . . . , (6,.(.t)) satisfying (piix) = 1. Suppose that E[cj)cj)^] € 
W^^ is invertihle, and that the rank of E[ucj)-^ ]E[(l)cj)^]~^ is equal to the dimension of the param- 
eter cx, i.e., row full rank. We assume that the moment function rj{x; 6) atO = 0* is represented 
as 

ri{x-e*) = ct>{x) + 4>{x) (13) 

where 0(x) is a function orthogonal to 4>{x), i.e., E[(p{x)(j>{x)'^] = O holds. Then, an optimal 
cf) is given as 

4> = ^B^iBB^r'U^u. (14) 

For the optimal choice of t], the maximum improvement is given as 

I 2 

DiffM = - E[Il^u{Il^u)'^]+o{l) 

n-\-n n[n + n') 

—I -sTi , n' -n^.^ 



-E[U^u{U^uy ] + -^E[n^u{U^uy ] + o(l) (15) 



n + n' ^ ^ n' 
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Proof. Due to <l)i{x) = 1, one has E[(p] = and £^[n^w] = E[l ■ H^w] = 0. Hence, one has 

Efn^w] = E[u] — E[U-^u] = 0. Our goad is to find which minimizes V[Bri — ^j^w] in (12) 
in the sense of positive definiteness. The orthogonal decomposition leads to 

V[Bn - -^,u] = V[Bct> - -^^4.u] + V[B4> - -^li^u], 
n + n' n + n' n + n' ^ 

because of the orthogonality between B4> — ;^q^n(^tt and B(f) — J^^, n^n, and the equality 
E[B$ - ^n^u] = 0. Hence, satisfying 

n + n' ^ 

is an optimal choice. Since the matrix B is row full rank, a solution of the above equation is 
given by 

$=^B^{BB^r'U^u. 

n + n' ^ 

We obtain the maximum improvement of Diff[tt] by using the equalities yp^w] = 
E[U^u{U^u)'^] and B(f) = E[u(j)^]E[4,(j)^]-'^(t) = H^u. □ 

Suppose that the optimal moment function rj = (j) + cf) presented in Theorem 1 is used with 
the score function u{x,y]OL). Then, the improvement (15) is maximized when E\]l(f)UiJl(j)U)^\ 
is minimized. Hence, the model w{x; 0) with the lower dimensional parameter 6 is preferable as 
long as the assumption in Theorem 1 is satisfied. This is intuitively understandable, because the 
statistical perturbation of the density-ratio estimator is minimized, when the smallest model is 
employed. 

Remark 1. Suppose that the basis functions, (t>i{x), . . . ,0r(x), are closely orthogonal to u, i.e., 
E[u4>^] is close to the null matrix. Then, the improvement Diff[w] is close to J^^, E[uu^]. As 

a result, we have sup0Diff[tt] = :^^^E[uu^] in which the supremum is taken over the basis of 
the density-ratio model satisfying the assumption in Theorem 1. However, the basis functions 
satisfying the exact equality E[ucf)^] = O is useless. Because, the equality E[u(l)'^] = O leads to 
B = O and thus, the equality (12) is reduced to 

DiffM = -^E[uu^] - ^V[^u]+oil) = 0(1). 
n + n' n' n + n' 

This result implies that there is the singularity at the basis function (p such that E[u(f)^] = O. 

It is not practical to apply the optimal function r]{x; 6) defined by (14). The optimal moment 

function depends on u, and one needs information on the probability p{y\x) to obtain the 
explicit form of u. The estimation of u needs non-parametric estimation, since the model 
misspecification of M is significant in our setup. Thus, we consider more practical estimator for 
the density ratio. Suppose that = holds for the moment function r]{x;0*).^Foi example, 
the optimal moment function (10) satisfies r]{x;6*) = ^^^0(ic) at = 0*, i.e., = 0. For the 
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Figure 1: The improvement Diff [w] is depicted as the function of the dimension of the density- 
ratio model. Since the improvement is represented by the matrix, the figure gives a view showing 
a frame format of the inequahty relation. When the dimension of 6 tends to infinity and n' > n 
holds, the two curves converges to the common positive definite matrix ^ Eluii^]. 



density-ratio model w{x; 6) = exp{0(x)^0} with 4'i{x) = 1 and the moment function satisfying 
r/(a;; 6*) = cf){x), a brief calculation yields that 

DiffM = "l^EiU^uiU^uf] + o(l). (16) 
n' 

Hence, the improvement is attained, when n < n' holds. As an interesting fact, we see that the 
larger model w{x; 0) attains the better improvement in (16). Indeed, TiffiU gets close to u, when 
the density-ratio model w{x]0) = exp{0"^</)(x)} becomes large. Hence, the non-parametric 
estimation of the density-ratio may be a good choice to achieve a large improvement for the 
estimation of the conditional probability. This is totally different from the case that the optimal 
4> presented in Theorem 1 is used in the density-ratio estimation. The relation between Diff [ti] 
using the optimal and Diff[ti] with = is illustrated in Figure 1. In the limit of the 
dimension of 6, both variance matrices converge to ^ E[uu^] monotonically. 

Example 1. Let u{x,y;a) be the score function of the model y = cx^b{x) + Z, Z N{^,a^), 
where b{x) = {bi{x), . . . ,bj^{x)) is the vector consisting of basis functions and is a known 
parameter. Then, one has u{x,y;a) = {y — b{x))b{x) . Suppose that the true conditional 
probability leads to the regression function y = f{x) + Z , where E[Z\x] = for all x. Then, 
one has u{x]Ol) = {f{x) — a^b{x))b{x) and Eluii^] = E[{f{x) — a.'^b{x))'^b{x)b{x)'^]. Hence, 
the upper bound of the improvement is governed by the degree of the model mis specification 
{f{x) — b{x))'^ . According to Theorem 1, an optimal moment function r]{x;6) is given as 

r){x; e*) = 0(x) + -^^B^{BB^)-^ ( {f{x) - a*'^ b{x))b{x) - B(t){x)) 
n + n' 

ate = e*, where B = E[{f - a*^b)bct)^]E[cl)ct)'^]-\ 
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7 Numerical Experiments 



We show numerical experiments to compare the standard supervised learning and the semi- 
supervised learning using DRESS. Both regression problems and classification problems are 
presented. 

7.1 Regression problems 

We consider the regression problem with the d-dimensional covariate variable shown below, 
labeled data: 

yi = l'^Xi + e^^^ + Zi, Zi^N{0,a^), i = l,...,n, (17) 
Xi^NaiO,Id), 1^ = (1,...,1)gM'^. 

unlabeled data: Xj ~ A^d(0, 1^), j = 1,. . . ,n'. 

regression model: y = ol^x + z, a e R'^, z ~ A'"(0, s^). 

score function: u{x,y; a) = {y — oc^x)x. 

The parameter e in (17) implies the degree of the model misspecification. Let be the 
target function, fe{x) = l^x + £||£c|p/d, and define 

e(e) = m.mE^[\fe{x) - a^xp], 

a 

which implies the squared distance from the true function to the linear regression model. On 
the other hand, the mean square error of the naive least mean square (LMS) estimator 5, i.e., 
EDa.ta.[Ex[\foix) — fi-^xp]], is asymptotically equal to a'^d/n, when the model is specified. We 
use the ratio 




as the normalized measure of the model misspecification. When 5^1 holds, the misspecification 
of the model can be statistically detected. 

First, we use a parametric model for density ratio estimation. For any positive integer k, let 
x^''^ be the d-dimensional vector (x^, . . . ,x^)^. The density-ratio model is defined as 

w{x; 6) = exp j^o + ^Jx + eja;^^) + • • • + elx^^^ } 

having Ld+1 dimensional parameter {9q,0i, . . . ,6l). We apply the estimator (10) presented by 
Qin (Qin, 1998). Note that the estimator (10) satisfies (f) = at 6 = 0*. Hence, the improvement 
is asymptotically given by (16). Under the setup of d = 2,n = 500, n' = 5000 and a = 0.2, we 
compute the mean square errors for LMS estimator a and DRESS a.. The difference of test 
errors, 

n ■ (S[(5^a; - Mx)f] - EUa^x - Mx)f]), 
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Figure 2: The difference of the mean square errors is plotted as the function of 5, where 5 is the 
normahzed measure of the model misspecification. The vertical axes "Improvement" denotes the 
difference of the mean square errors between LMS estimator and DRESS. Positive improvement 
denotes that DRESS outperforms LMS estimator. 



is evaluated for each e and each dimension of the density ratio, Ld + 1, where the expectation 
is evaluated over the test samples. The mean square error is calculated by the average over 500 
iterations. 

Figure 2 shows the results. When the model is specified, i.e., S = 0{£ = 0), LMS estimator 
presents better performance than DRESS. Under the practical setup such as (5 > 1, however, 
we see that DRESS outperforms LMS estimator. The dependency on the dimension of the 
density-ratio model is not clearly detected in this experiment. Overall, larger density-ratio 
model presents rather unstable result. Indeed, in DRESS with large density ratio model, say 
the right bottom panel in Figure 2, the mean square error of DRESS can be large, i.e., the 
improvement is negative, even when the model misspecification 5 is large. 

Next, we compare LMS estimator and DRESS with a nonparametric estimator of the density- 
ratio. Here, we use KuLSIF (Kanamori et al., 2012) as the density-ratio estimator. KuLSIF is a 
non-parametric estimator of the density-ratio based on the kernel method. The regularization is 
efficiently conducted to suppress the degree of freedom of the nonparametric model. In KuLSIF, 
the kernel function of the reproducing kernel Hilbert space corresponds to the basis function 
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— naive estimator (LMS) 
DRESS (n'=20) 

— DRESS (n'=100) 

— DRESS (n'=1000) 



— naive estimator (LIVIS) 
DRESS (n'=20) 

— DRESS (n'=100) 

— DRESS {n'=1000) 



— naive estimator (LIVIS) 
DRESS (n'=20) 

— DRESS (n'=1 00) 

— DRESS (n'=1 000) 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 

deita=(modei error)/(statisticai error) deita=(modei error)/(statisticai error) deita=(modei error)/(statisticai error) 

a = 0.1 a = 0.2 a = 0.5 

Figure 3: The square root of mean square errors of naive estimator and DRESS with n' = 
20, 100, 1000 are depicted as the function of 5, where 5 is the normalized measure of the model 
misspecification. The sample size of the labeled data is n = 50, and a is the standard deviation 
of the noise involved in the dependent variable y. 



Under the setup of d = 10,n = 50, n' = 20,100,1000 and a = 0.1,0,2,0.5, we compute 
the mean square errors by the average over 100 iterations. In Figure 3, the square root of 
the mean square errors for LMS estimator and DRESS are plotted as the function of 5, i.e., 
(model error) /(statistical error). When 5 is around 1, it is statistically hard to detect the model 
misspecification by the training data of the size n = 50. When the model is specified (e = 0), 
LMS estimator presents better performance than DRESS. Under the practical setup such as 
6 > 1, however, we see that DRESS with KuLSIF outperforms LMS estimator. As shown in the 
asymptotic analysis, we notice that the sample size of the unlabeled data affects the estimation 
accuracy of DRESS. The numerical results show that DRESS with large n' attains the smaller 
error comparing to DRESS with small n' , especially when 5 > 1 holds. In the numerical 
experiment, even DRESS with n = 50 and n' = 20 slightly outperforms LMS estimator. This 
is not supported by the asymptotic analysis. Hence, we need more involved theoretical study 
about the statistical feature of semi-supervised learning. 

7.2 Classification problems 

As a classification task, we use spam dataset in "kernlab" of R package (Karatzoglou et al., 2004). 
The dataset includes 4601 samples. The dimension of the covariate is 57, i.e., x = (xi, . . . , x^y)'^ 
whose elements represent statistical features of each document. The output y is assigned to 
"spam" or "nonspam". 

For the binary classification problem, we use the logistic model, 

P(spam \x;a) = ^— ^ -, 

1 + exp|-Qo - }^d=i OidXd} 

where D is the dimension of the covariate used in the logistic model. In numerical experiments, 
D varies from 10 to 57, hence, the dimension of the model parameter o; varies from 11 to 58. We 
tested DRESS with KuLSIF (Kanamori et al., 2012) and MLE with n = 200, 500, 800 randomly 
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chosen labeled training samples and n' = 100, 500, 1000, 2000 unlabeled training samples. The 
remaining samples are served as the test data. The score function u{x,y;a) = VlogP(y|£c; a) 
is used for the estimation. 

Table 1 shows the prediction errors (%) with the standard deviation. We also show the 
p-value of the one-tailed paired t-test for prediction errors of DRESS and MLE. Small p-values 
denote the superiority of DRESS. Wc notice that p-value is small when the dimension D is 
not large. In other word, the numerical results meet the asymptotic theory in Section 6. For 
relatively high dimensional models, the prediction error of MLE is smaller than that of DRESS; 
see the row of I? = 57 in Table 1. The size of unlabeled data, n', also affects the results. Indeed, 
the p-value becomes small for large n'. This result is supported by the asymptotic analysis 
presented in Section 6. 

Table 1: Prediction errors (%) for DRESS with KuLSIF and MLE are shown. The p-values of 
the one-tailed paired t-test for prediction errors are also presented. 





n = 


: 200, n' = 100 




n = 


: 500, n' = 100 




n = 


: 800, n' = 100 




D 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


10 


21.48±0.95 


21.69±1.09 


0.023 


20.86±0.76 


20.93±0.82 


0.163 


20.73±0.71 


20.72±0.67 


0.541 


20 


18.81±1.30 


18.54±1.42 


0.987 


17.15±0.79 


17.16±0.90 


0.424 


16.63±0.68 


16.93±0.87 


0.000 


30 


14.67±1.54 


14.44±1.50 


0.993 


11.83±0.75 


11.92±0.83 


0.056 


11.29±0.51 


11.39±0.56 


0.057 


40 


16.16±1.79 


16.06±1.83 


0.910 


12.18±0.81 


12.19±0.84 


0.410 


11.24±0.60 


11.40±0.62 


0.005 


50 


15.98±2.49 


15.84±2.45 


0.939 


11.41±0.94 


11.25±0.94 


0.988 


10.13±0.66 


10.15±0.65 


0.359 


57 


15.08±2.64 


15.01±2.67 


0.777 


10.83±1.06 


10.59±0.90 


1.000 


9.07±0.61 


8.98±0.70 


0.959 




n = 


: 200, n' = 500 




n = 


: 500, n' = 500 




n = 


: 800, n' = 500 




D 


DRESS 


MLE 


p- value 


DRESS 


MLE 


p- value 


DRESS 


MLE 


p-value 


10 


21.34±0.88 


21.59±1.12 


0.003 


20.56±0.70 


21.06±0.84 


0.000 


20.40±0.62 


20.76±0.71 


0.000 


20 


18.58±1.40 


18.60±1.45 


0.406 


16.76±0.79 


17.10±0.96 


0.000 


16.51±0.67 


16.95±0.90 


0.000 


30 


14.46±1.50 


14.48±1.39 


0.392 


11.71±0.70 


11.86±0.73 


0.002 


11.21±0.56 


11.46±0.58 


0.000 


40 


15.88±1.98 


15.83±2.05 


0.759 


11.96±0.79 


12.04±0.77 


0.035 


11.13±0.56 


11.41±0.62 


0.000 


50 


16.18±2.31 


16.22±2.30 


0.303 


11.24±0.92 


11.26±0.93 


0.350 


10.04±0.66 


10.13±0.70 


0.021 


57 


14.88±2.82 


14.77±2.77 


0.933 


10.83±1.07 


10.61±0.98 


1.000 


8.79±0.70 


8.81±0.63 


0.319 




n = 


200, n' = 1000 




n = 


500, n' = 1000 




n = 


800, n' = 1000 




D 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


10 


21.26±0.96 


21.74±1.28 


0.000 


20.57±0.74 


21.02±0.80 


0.000 


20.29±0.61 


20.74±0.64 


0.000 


20 


18.37±1.27 


18.63±1.45 


0.001 


16.78±0.70 


17.08±1.00 


0.000 


16.47±0.65 


16.93±0.80 


0.000 


30 


14.53±1.51 


14.60±1.42 


0.089 


11.73±0.67 


12.04±0.78 


0.000 


11.16±0.62 


11.43±0.69 


0.000 


40 


16.05±1.97 


16.06±1.92 


0.463 


11.84±0.78 


11.91±0.75 


0.098 


11.19±0.63 


11.45±0.72 


0.000 


50 


15.58±2.16 


15.52±2.10 


0.703 


11.20±0.86 


11.20±0.86 


0.566 


9.94±0.75 


10.06±0.80 


0.006 


57 


14.99±2.86 


14.94±2.93 


0.684 


10.83±1.04 


10.75±0.98 


0.935 


8.88±0.72 


8.99±0.73 


0.014 




n = 


200, n' = 2000 




n = 


500, n' = 2000 




n = 


800, n' = 2000 




D 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


DRESS 


MLE 


p-value 


10 


21.31±1.06 


21.78±1.27 


0.000 


20.49±0.81 


21.00±0.94 


0.000 


20.18±0.85 


20.70±1.02 


0.000 


20 


18.36±1.35 


18.62±1.51 


0.008 


16.79±0.86 


17.18±1.09 


0.000 


16.37±0.80 


16.88±0.97 


0.000 


30 


14.66±1.71 


14.53±1.69 


0.956 


11.65±0.77 


11.82±0.79 


0.001 


11.12±0.79 


11.44±0.85 


0.000 


40 


15.78±1.76 


15.60±1.74 


0.985 


11.81±0.90 


12.10±0.97 


0.000 


10.94±0.81 


11.33±0.79 


0.000 


50 


16.21±2.17 


16.01±2.14 


0.973 


11.24±1.02 


11.29±0.98 


0.183 


10.01±0.78 


10.19±0.78 


0.001 


57 


14.87±2.57 


14.95±2.55 


0.170 


10.52±1.13 


10.56±1.14 


0.187 


8.71±0.79 


8.91±0.84 


0.000 
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8 Conclusion 



In this paper, we investigated the semi-supervised learning with density-ratio estimator. We 
proved that the unlabeled data is useful when the model of the conditional probability p{y\x) 
is misspecified. This result agrees to the result given by Sokolovska, et al. (Sokolovska et al., 
2008), in which the weight function is estimated by using the estimator of the marginal prob- 
ability p{x) under a specified model of p{x). The estimator proposed in this paper is useful in 
practice, since our method does not require the well-specified model for the marginal probability. 
Numerical experiments present the effectiveness of our method. We are currently investigating 
semi-supervised learning from the perspective of scmiparamctric inference with missing data. A 
positive use of the statistical paradox in semiparametric inference is an interesting future work 
for semi-supervised learning. 
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